-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add third data quality metric #11939
base: develop
Are you sure you want to change the base?
Conversation
…nto wip/mk/dq-metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some tweaks for efficiency but like the approach.
import org.enso.table.data.table.Column; | ||
import org.graalvm.polyglot.Context; | ||
|
||
public class CountWhitespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create a base class (something like SampledOperation
), which would hold the RANDOM_SEED
and DEFAULT_SAMPLE_SIZE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CountWhitespace
feels the wrong name, probably should be CountNonTrivialWhitespace
.
private Future<Long> untrimmedCount; | ||
private Future<Long> whitespaceCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets create a record type here.
record DataQualityMetrics(Long untrimmedCount, Long whitespaceCount);
We can then do it as sinlge CompletableFuture to compute the values.
List<String> trivialWhiteSpaceList = | ||
List.of( | ||
"\u200A", "\u200B", "\u205F", "\u2004", "\u2005", "\u2006", "\u2008", "\u2009", | ||
"\u2007", "\r", "\n", "\t", "\u2002", "\u00A0", "\u3000", "\u2003"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set<Char>
should work here I think.
The variable name makes me think it is good not bad.
We can then loop over the characters of the String
and return true if the set contains any of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also confused by the variable name, shouldn't it be nonTrivialWhiteSpace
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I imagine the JVM is good at optimizing such things, but perhaps we will make it easier for it if we make the list (or Set) a private static final
field ensuring that it is computed only once upon initialization and not on every invocation of this function (since this function may be invoked millions of times).
I imagine the JIT would fold this constant at some point, but making this a constant from the beginning maybe would just be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other thing - how do we know that this list is comprehensive?
Checking e.g. https://jkorpela.fi/chars/spaces.html suggests that we are missing e.g. the NARROW NO-BREAK SPACE U+202F
or U+180E
- the "MONGOLIAN VOWEL SEPARATOR" 😅
I think that as long as we can reasonably avoid it, I'd rather be against including such lists of constants in our code - it is easy for them to miss some obscure constants, and they are a bit less future proof. If instead we rely on ICU4J or the Java SDK itself, if there are new Unicode versions released in the future we'd just need to rely on them to get updated to the latest version instead of having to update many of such constant lists manually (which we will probably never do unless users start reporting bugs).
Of course things like "MONGOLIAN VOWEL SEPARATOR" are probably not that commonly used in most cases and missing them probably won't be a huge deal for a generic metric. But if we can use existing dependencies that we rely on anyway, I'd suggest preferring that.
Thus we could rewrite the code to essentially:
- iterate over every character in the text,
- check if it is any whitespace using
UCharacter::isUWhiteSpace
, - if it is whitespace - then check if it is 'non-trivial' - a non trivial whitespace is any whitespace that is not the
" "
character.
for (...) {
char c = str.charAt(i);
if (UCharacter::isUWhiteSpace(c)) {
boolean isNonTrivial = c != ' ';
if (isNonTrivial) { return true; }
}
context.safepoint();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry small correction - after further reading perhaps the two examples I provided above may actually not be classified as whitespace by the Unicode standard... So perhaps the List was good.
But I think we can still simplify the code by relying on isUWhiteSpace
. It will also likely be much faster as it relies on bit-fiddling and boils down to 1 bit check per character instead of 16 independent text searches.
number_untrimmed = case all_rows_count > Column.default_sample_size of | ||
False -> JS_Object.from_pairs [["name", "Count untrimmed whitespace"], ["percentage_value", columns.map .count_untrimmed]] | ||
True -> JS_Object.from_pairs [["name", "Count untrimmed whitespace (sampled)"], ["percentage_value", columns.map .count_untrimmed]] | ||
[number_nothing, number_untrimmed] | ||
number_non_triv = case all_rows_count > Column.default_sample_size of | ||
False -> JS_Object.from_pairs [["name", "Count non trivial whitespace"], ["percentage_value", columns.map .count_non_trivial_whitespace]] | ||
True -> JS_Object.from_pairs [["name", "Count non trivial whitespace (sampled)"], ["percentage_value", columns.map .count_non_trivial_whitespace]] | ||
JS_Object.from_pairs | ||
[number_nothing, number_untrimmed, number_non_triv] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change it so we check the count only once. Tiny nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name_extra = if all_rows_count > Column.default_sample_size then " (sampled)" else ""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be much better if we go the name_extra
route and avoid the duplication in all the percentage_value
entries.
Used for data quality indicator in Table Viz. | ||
count_non_trivial_whitespace : Integer -> Integer | Nothing | ||
count_non_trivial_whitespace self sample_size:Integer=Column.default_sample_size = | ||
if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else | |
if self.value_type.is_text.not then Nothing else |
I think Mixed implies is_text.not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No but this suggestion is not equivalent:
(self.value_type == Value_Type.Mixed || self.value_type.is_text).not
===
self.value_type != Value_Type.Mixed && self.value_type.is_text.not
!==
self.value_type.is_text.not
The counter-example is a mixed type column - self.value_type == Value_Type.Mixed
.
Then the first and the equivalent second expression will evaluate to False
but your suggested (third) expression evaluates to True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it becomes clearer if the condition is named:
if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else | |
is_eligible_for_whitespace_count = self.value_type == Value_Type.Mixed || self.value_type.is_text | |
if is_eligible_for_whitespace_count.not then Nothing else |
number_untrimmed = case all_rows_count > Column.default_sample_size of | ||
False -> JS_Object.from_pairs [["name", "Count untrimmed whitespace"], ["percentage_value", columns.map .count_untrimmed]] | ||
True -> JS_Object.from_pairs [["name", "Count untrimmed whitespace (sampled)"], ["percentage_value", columns.map .count_untrimmed]] | ||
[number_nothing, number_untrimmed] | ||
number_non_triv = case all_rows_count > Column.default_sample_size of | ||
False -> JS_Object.from_pairs [["name", "Count non trivial whitespace"], ["percentage_value", columns.map .count_non_trivial_whitespace]] | ||
True -> JS_Object.from_pairs [["name", "Count non trivial whitespace (sampled)"], ["percentage_value", columns.map .count_non_trivial_whitespace]] | ||
JS_Object.from_pairs | ||
[number_nothing, number_untrimmed, number_non_triv] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name_extra = if all_rows_count > Column.default_sample_size then " (sampled)" else ""
Pull Request Description
Important Notes
Checklist
Please ensure that the following checklist has been satisfied before submitting the PR:
Scala,
Java,
TypeScript,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
or the Snowflake database integration, a run of the Extra Tests has been scheduled.